For this mini project , the dataset taken from kaggle.The Indian Premier League (IPL) is a professional Twenty20 cricket league in India, typically held between March and May each year. It features eight to ten teams representing various cities or states across India. Established by the Board of Control for Cricket in India (BCCI) in 2007, the IPL is the world’s most-attended cricket league and has a significant brand value.
rm(list=ls())
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tibble)
library(ggplot2)
library(readr)
library(sf)
## Warning: package 'sf' was built under R version 4.3.3
## Linking to GEOS 3.11.2, GDAL 3.8.2, PROJ 9.3.1; sf_use_s2() is TRUE
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.3.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(randomForest)
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(caret)
## Loading required package: lattice
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
matches <- read_csv("C:/Users/Administrator/Desktop/SUMMER_A_2024/Data_Visualisations/Siri_kesidi_IPL_Decoding_Data_viz_mini_2/Siri_MP_2_data/matches.csv")
## Rows: 756 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (13): city, date, team1, team2, toss_winner, toss_decision, result, winn...
## dbl (5): id, season, dl_applied, win_by_runs, win_by_wickets
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
deliveries <- read_csv("C:/Users/Administrator/Desktop/SUMMER_A_2024/Data_Visualisations/Siri_kesidi_IPL_Decoding_Data_viz_mini_2/Siri_MP_2_data/deliveries.csv")
## Rows: 179078 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (8): batting_team, bowling_team, batsman, non_striker, bowler, player_d...
## dbl (13): match_id, inning, over, ball, is_super_over, wide_runs, bye_runs, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data set includes match ID, season year, location, date, and important match facts, among other details related to IPL matches played in different seasons. Whether or not the Duckworth-Lewis technique is used because of weather delays, it provides information about the teams that are playing, the results of their matches, and the results of their tosses. Details on the venue, player of the match, victory margins, and umpire identities are given. While Duckworth-Lewis applications and victory margins are examples of numerical variables, most qualities are recorded as character data types. In-depth documentation is also kept of cricket-specific information such as innings, batting and bowling teams, overs, balls, runs scored, extra runs, and player dismissals.
summary(matches)
## id season city date
## Min. : 1.0 Min. :2008 Length:756 Length:756
## 1st Qu.: 189.8 1st Qu.:2011 Class :character Class :character
## Median : 378.5 Median :2013 Mode :character Mode :character
## Mean : 1792.2 Mean :2013
## 3rd Qu.: 567.2 3rd Qu.:2016
## Max. :11415.0 Max. :2019
## team1 team2 toss_winner toss_decision
## Length:756 Length:756 Length:756 Length:756
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## result dl_applied winner win_by_runs
## Length:756 Min. :0.00000 Length:756 Min. : 0.00
## Class :character 1st Qu.:0.00000 Class :character 1st Qu.: 0.00
## Mode :character Median :0.00000 Mode :character Median : 0.00
## Mean :0.02513 Mean : 13.28
## 3rd Qu.:0.00000 3rd Qu.: 19.00
## Max. :1.00000 Max. :146.00
## win_by_wickets player_of_match venue umpire1
## Min. : 0.000 Length:756 Length:756 Length:756
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 4.000 Mode :character Mode :character Mode :character
## Mean : 3.351
## 3rd Qu.: 6.000
## Max. :10.000
## umpire2 umpire3
## Length:756 Length:756
## Class :character Class :character
## Mode :character Mode :character
##
##
##
summary(deliveries)
## match_id inning batting_team bowling_team
## Min. : 1 Min. :1.000 Length:179078 Length:179078
## 1st Qu.: 190 1st Qu.:1.000 Class :character Class :character
## Median : 379 Median :1.000 Mode :character Mode :character
## Mean : 1802 Mean :1.483
## 3rd Qu.: 567 3rd Qu.:2.000
## Max. :11415 Max. :5.000
## over ball batsman non_striker
## Min. : 1.00 Min. :1.000 Length:179078 Length:179078
## 1st Qu.: 5.00 1st Qu.:2.000 Class :character Class :character
## Median :10.00 Median :4.000 Mode :character Mode :character
## Mean :10.16 Mean :3.616
## 3rd Qu.:15.00 3rd Qu.:5.000
## Max. :20.00 Max. :9.000
## bowler is_super_over wide_runs bye_runs
## Length:179078 Min. :0.0000000 Min. :0.00000 Min. :0.000000
## Class :character 1st Qu.:0.0000000 1st Qu.:0.00000 1st Qu.:0.000000
## Mode :character Median :0.0000000 Median :0.00000 Median :0.000000
## Mean :0.0004523 Mean :0.03672 Mean :0.004936
## 3rd Qu.:0.0000000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.0000000 Max. :5.00000 Max. :4.000000
## legbye_runs noball_runs penalty_runs batsman_runs
## Min. :0.00000 Min. :0.000000 Min. :0.0e+00 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:0.0e+00 1st Qu.:0.000
## Median :0.00000 Median :0.000000 Median :0.0e+00 Median :1.000
## Mean :0.02114 Mean :0.004183 Mean :5.6e-05 Mean :1.247
## 3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:0.0e+00 3rd Qu.:1.000
## Max. :5.00000 Max. :5.000000 Max. :5.0e+00 Max. :7.000
## extra_runs total_runs player_dismissed dismissal_kind
## Min. :0.00000 Min. : 0.000 Length:179078 Length:179078
## 1st Qu.:0.00000 1st Qu.: 0.000 Class :character Class :character
## Median :0.00000 Median : 1.000 Mode :character Mode :character
## Mean :0.06703 Mean : 1.314
## 3rd Qu.:0.00000 3rd Qu.: 1.000
## Max. :7.00000 Max. :10.000
## fielder
## Length:179078
## Class :character
## Mode :character
##
##
##
str(deliveries)
## spc_tbl_ [179,078 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ match_id : num [1:179078] 1 1 1 1 1 1 1 1 1 1 ...
## $ inning : num [1:179078] 1 1 1 1 1 1 1 1 1 1 ...
## $ batting_team : chr [1:179078] "Sunrisers Hyderabad" "Sunrisers Hyderabad" "Sunrisers Hyderabad" "Sunrisers Hyderabad" ...
## $ bowling_team : chr [1:179078] "Royal Challengers Bangalore" "Royal Challengers Bangalore" "Royal Challengers Bangalore" "Royal Challengers Bangalore" ...
## $ over : num [1:179078] 1 1 1 1 1 1 1 2 2 2 ...
## $ ball : num [1:179078] 1 2 3 4 5 6 7 1 2 3 ...
## $ batsman : chr [1:179078] "DA Warner" "DA Warner" "DA Warner" "DA Warner" ...
## $ non_striker : chr [1:179078] "S Dhawan" "S Dhawan" "S Dhawan" "S Dhawan" ...
## $ bowler : chr [1:179078] "TS Mills" "TS Mills" "TS Mills" "TS Mills" ...
## $ is_super_over : num [1:179078] 0 0 0 0 0 0 0 0 0 0 ...
## $ wide_runs : num [1:179078] 0 0 0 0 2 0 0 0 0 0 ...
## $ bye_runs : num [1:179078] 0 0 0 0 0 0 0 0 0 0 ...
## $ legbye_runs : num [1:179078] 0 0 0 0 0 0 1 0 0 0 ...
## $ noball_runs : num [1:179078] 0 0 0 0 0 0 0 0 0 1 ...
## $ penalty_runs : num [1:179078] 0 0 0 0 0 0 0 0 0 0 ...
## $ batsman_runs : num [1:179078] 0 0 4 0 0 0 0 1 4 0 ...
## $ extra_runs : num [1:179078] 0 0 0 0 2 0 1 0 0 1 ...
## $ total_runs : num [1:179078] 0 0 4 0 2 0 1 1 4 1 ...
## $ player_dismissed: chr [1:179078] NA NA NA NA ...
## $ dismissal_kind : chr [1:179078] NA NA NA NA ...
## $ fielder : chr [1:179078] NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. match_id = col_double(),
## .. inning = col_double(),
## .. batting_team = col_character(),
## .. bowling_team = col_character(),
## .. over = col_double(),
## .. ball = col_double(),
## .. batsman = col_character(),
## .. non_striker = col_character(),
## .. bowler = col_character(),
## .. is_super_over = col_double(),
## .. wide_runs = col_double(),
## .. bye_runs = col_double(),
## .. legbye_runs = col_double(),
## .. noball_runs = col_double(),
## .. penalty_runs = col_double(),
## .. batsman_runs = col_double(),
## .. extra_runs = col_double(),
## .. total_runs = col_double(),
## .. player_dismissed = col_character(),
## .. dismissal_kind = col_character(),
## .. fielder = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
str(matches)
## spc_tbl_ [756 × 18] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:756] 1 2 3 4 5 6 7 8 9 10 ...
## $ season : num [1:756] 2017 2017 2017 2017 2017 ...
## $ city : chr [1:756] "Hyderabad" "Pune" "Rajkot" "Indore" ...
## $ date : chr [1:756] "2017-04-05" "2017-04-06" "2017-04-07" "2017-04-08" ...
## $ team1 : chr [1:756] "Sunrisers Hyderabad" "Mumbai Indians" "Gujarat Lions" "Rising Pune Supergiant" ...
## $ team2 : chr [1:756] "Royal Challengers Bangalore" "Rising Pune Supergiant" "Kolkata Knight Riders" "Kings XI Punjab" ...
## $ toss_winner : chr [1:756] "Royal Challengers Bangalore" "Rising Pune Supergiant" "Kolkata Knight Riders" "Kings XI Punjab" ...
## $ toss_decision : chr [1:756] "field" "field" "field" "field" ...
## $ result : chr [1:756] "normal" "normal" "normal" "normal" ...
## $ dl_applied : num [1:756] 0 0 0 0 0 0 0 0 0 0 ...
## $ winner : chr [1:756] "Sunrisers Hyderabad" "Rising Pune Supergiant" "Kolkata Knight Riders" "Kings XI Punjab" ...
## $ win_by_runs : num [1:756] 35 0 0 0 15 0 0 0 97 0 ...
## $ win_by_wickets : num [1:756] 0 7 10 6 0 9 4 8 0 4 ...
## $ player_of_match: chr [1:756] "Yuvraj Singh" "SPD Smith" "CA Lynn" "GJ Maxwell" ...
## $ venue : chr [1:756] "Rajiv Gandhi International Stadium, Uppal" "Maharashtra Cricket Association Stadium" "Saurashtra Cricket Association Stadium" "Holkar Cricket Stadium" ...
## $ umpire1 : chr [1:756] "AY Dandekar" "A Nand Kishore" "Nitin Menon" "AK Chaudhary" ...
## $ umpire2 : chr [1:756] "NJ Llong" "S Ravi" "CK Nandan" "C Shamshuddin" ...
## $ umpire3 : chr [1:756] NA NA NA NA ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. season = col_double(),
## .. city = col_character(),
## .. date = col_character(),
## .. team1 = col_character(),
## .. team2 = col_character(),
## .. toss_winner = col_character(),
## .. toss_decision = col_character(),
## .. result = col_character(),
## .. dl_applied = col_double(),
## .. winner = col_character(),
## .. win_by_runs = col_double(),
## .. win_by_wickets = col_double(),
## .. player_of_match = col_character(),
## .. venue = col_character(),
## .. umpire1 = col_character(),
## .. umpire2 = col_character(),
## .. umpire3 = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
# Calculate total runs per over per team
team_runs_per_over <- deliveries %>%
group_by(batting_team, over) %>%
summarise(total_runs = sum(total_runs)) %>%
ungroup()
## `summarise()` has grouped output by 'batting_team'. You can override using the
## `.groups` argument.
team_runs_per_over
## # A tibble: 300 × 3
## batting_team over total_runs
## <chr> <dbl> <dbl>
## 1 Chennai Super Kings 1 870
## 2 Chennai Super Kings 2 1116
## 3 Chennai Super Kings 3 1293
## 4 Chennai Super Kings 4 1354
## 5 Chennai Super Kings 5 1423
## 6 Chennai Super Kings 6 1447
## 7 Chennai Super Kings 7 1159
## 8 Chennai Super Kings 8 1186
## 9 Chennai Super Kings 9 1246
## 10 Chennai Super Kings 10 1165
## # ℹ 290 more rows
# Create a static line plot using ggplot2 with facets
line_plot <- ggplot(team_runs_per_over, aes(x = over, y = total_runs, color = batting_team)) +
geom_line() +
facet_wrap(~ batting_team, scales = "free_y") +
labs(title = "Total Runs Scored per Over by Teams",
x = "Over",
y = "Total Runs") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Convert the static line plot to an interactive plot using plotly
interactive_line_plot <- ggplotly(line_plot)
# Save the interactive plot as a self-contained HTML file
htmlwidgets::saveWidget(interactive_line_plot, "team_runs_per_over_line_plot.html")
interactive_line_plot
The original chart I planned to create for this assignment was a static line plot showing the total runs scored per over by each team, with facets for each batting team. To prepare the data, I first grouped the deliveries dataset by batting team and over, then calculated the sum of total runs for each combination. This involved basic data cleaning steps such as filtering and aggregating the data.
With the plots, I can illustrate the scoring patterns of each team over the course of an innings, highlighting periods of aggressive scoring or slow accumulation. The static line plot allows for easy comparison between teams, revealing any disparities in scoring rates or consistency. Difficulties encountered included ensuring the readability of the plot with multiple facets and selecting appropriate color schemes for clarity. Additional approaches to explore the data could include analyzing the distribution of runs scored in different overs.
Clear labels and titles were provided to aid interpretation, and the use of facets allowed for easy comparison between teams. The interactive element enhances engagement by allowing viewers to hover over data points for more information. The minimalist theme and rotated axis labels improve readability, adhering to best practices for effective data visualization.
indian_cities <- read_csv("C:/Users/Administrator/Desktop/SUMMER_A_2024/Data_Visualisations/Siri_kesidi_IPL_Decoding_Data_viz_mini_2/Siri_MP_2_data/Indian_cities.csv")
## Rows: 1267 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): State, District, City, Population, Area (in km^2)
## dbl (2): Latitude, Longitude
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Merge IPL dataset with geographical data
matches_india <- matches %>%
left_join(indian_cities, by = c("city" = "District"))
## Warning in left_join(., indian_cities, by = c(city = "District")): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 475 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
# Plot matches on the map using leaflet
map <- leaflet(data = matches_india) %>%
addTiles() %>%
addMarkers(~Longitude, ~Latitude, popup = ~venue)
## Warning in validateCoords(lng, lat, funcName): Data contains 119 rows with
## either missing or invalid lat/lon values and will be ignored
# Display the map
map
# Save the interactive plot as a self-contained HTML file
htmlwidgets::saveWidget(map, "venue_matches_played.html")
The original chart planned for this assignment was a spatial visualization showing the locations of IPL matches held in India on a map using shape file but I could not able to find correct information India.shp file. The necessary steps for cleaning and preparing the data involved: Loading geographical data for Indian cities, including latitude and longitude coordinates. Merging the IPL dataset with geographical data based on city names. Plotting the matches on a map using the leaflet package.
With the plotted map, one could tell the story of where IPL matches were held across India. This visualization provides insights into the distribution of matches geographically, highlighting regions where cricket is popular and where IPL teams frequently play. Difficulties encountered during visualization creation might include mismatched city names between datasets or missing geographical coordinates for certain cities. Additional approaches to explore the data could involve clustering analysis to identify regions with high match density or overlaying demographic data to understand the audience reach of IPL matches.
The principles of data visualization and design were applied by choosing an appropriate visualization method (spatial visualization using a map), ensuring clarity in the plotted data points, and providing interactivity for users to explore match details. The map provides a clear representation of match locations, allowing viewers to easily understand the geographical distribution of IPL matches in India. Additionally, popup markers with venue details enhance the user experience by providing additional information upon interaction with the map.
colSums(is.na(matches))
## id season city date team1
## 0 0 7 0 0
## team2 toss_winner toss_decision result dl_applied
## 0 0 0 0 0
## winner win_by_runs win_by_wickets player_of_match venue
## 4 0 0 4 0
## umpire1 umpire2 umpire3
## 2 2 637
df_clean <- na.omit(matches)
str(df_clean)
## tibble [118 × 18] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:118] 7894 7895 7896 7897 7898 ...
## $ season : num [1:118] 2018 2018 2018 2018 2018 ...
## $ city : chr [1:118] "Mumbai" "Mohali" "Kolkata" "Hyderabad" ...
## $ date : chr [1:118] "07/04/18" "08/04/18" "08/04/18" "09/04/18" ...
## $ team1 : chr [1:118] "Mumbai Indians" "Delhi Daredevils" "Royal Challengers Bangalore" "Rajasthan Royals" ...
## $ team2 : chr [1:118] "Chennai Super Kings" "Kings XI Punjab" "Kolkata Knight Riders" "Sunrisers Hyderabad" ...
## $ toss_winner : chr [1:118] "Chennai Super Kings" "Kings XI Punjab" "Kolkata Knight Riders" "Sunrisers Hyderabad" ...
## $ toss_decision : chr [1:118] "field" "field" "field" "field" ...
## $ result : chr [1:118] "normal" "normal" "normal" "normal" ...
## $ dl_applied : num [1:118] 0 0 0 0 0 1 0 0 0 0 ...
## $ winner : chr [1:118] "Chennai Super Kings" "Kings XI Punjab" "Kolkata Knight Riders" "Sunrisers Hyderabad" ...
## $ win_by_runs : num [1:118] 0 0 0 0 0 10 0 0 0 0 ...
## $ win_by_wickets : num [1:118] 1 6 4 9 5 0 1 4 7 5 ...
## $ player_of_match: chr [1:118] "DJ Bravo" "KL Rahul" "SP Narine" "S Dhawan" ...
## $ venue : chr [1:118] "Wankhede Stadium" "Punjab Cricket Association IS Bindra Stadium, Mohali" "Eden Gardens" "Rajiv Gandhi International Stadium, Uppal" ...
## $ umpire1 : chr [1:118] "Chris Gaffaney" "Rod Tucker" "C Shamshuddin" "Nigel Llong" ...
## $ umpire2 : chr [1:118] "A Nanda Kishore" "K Ananthapadmanabhan" "A.D Deshmukh" "Vineet Kulkarni" ...
## $ umpire3 : chr [1:118] "Anil Chaudhary" "Nitin Menon" "S Ravi" "O Nandan" ...
## - attr(*, "na.action")= 'omit' Named int [1:638] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "names")= chr [1:638] "1" "2" "3" "4" ...
# Assuming you have already loaded the 'matches' dataset
# Fit a linear regression model
lm_model <- lm(win_by_runs ~ win_by_wickets + season, data = df_clean)
# Summary of the linear regression model
summary(lm_model)
##
## Call:
## lm(formula = win_by_runs ~ win_by_wickets + season, data = df_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.864 -9.396 -2.291 4.900 94.136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2939.5158 6610.9573 -0.445 0.657
## win_by_wickets -3.5955 0.5085 -7.070 1.28e-10 ***
## season 1.4677 3.2752 0.448 0.655
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.76 on 115 degrees of freedom
## Multiple R-squared: 0.303, Adjusted R-squared: 0.2909
## F-statistic: 25 on 2 and 115 DF, p-value: 9.677e-10
# Visualize the coefficients
par(mfrow=c(1,1))
plot(lm_model)
The original plan seems to be creating a linear regression model to predict win_by_runs using win_by_wickets and season as predictors. This requires cleaning the data, ensuring the variables are in the correct format, and checking for missing values.
With these plots, you can tell the story of how win_by_wickets and season affect the win_by_runs. The coefficients plot will show the impact of each predictor variable on the outcome, and the summary will provide statistical details of the model’s fit.One difficulty understanding how residues vs fitted interpretation could be interpreting the coefficients if the variables are not on the same scale. Normalizing or standardizing the variables could help. Additional approaches could include adding interaction terms between win_by_wickets and season to capture any combined effects.
The use of plot() and summary() functions aligns with the principles of data visualization by providing clear, concise, and informative visualizations and summaries of the linear regression model. However, more visualization techniques could be explored, such as residual plots or diagnostic plots, to further validate the model assumptions and performance.
# Convert necessary variables to factors
df_clean$winner <- as.factor(df_clean$winner)
df_clean$city <- as.factor(df_clean$city)
df_clean$team1 <- as.factor(df_clean$team1)
df_clean$team2 <- as.factor(df_clean$team2)
df_clean$toss_winner <- as.factor(df_clean$toss_winner)
df_clean$toss_decision <- as.factor(df_clean$toss_decision)
df_clean$venue <- as.factor(df_clean$venue)
df_clean$umpire1 <- as.factor(df_clean$umpire1)
df_clean$umpire2 <- as.factor(df_clean$umpire2)
df_clean$umpire3 <- as.factor(df_clean$umpire3)
# Split the data into training and testing sets
set.seed(123) # for reproducibility
train_index <- sample(1:nrow(df_clean), 0.7 * nrow(df_clean))
train_data <- df_clean[train_index, ]
test_data <- df_clean[-train_index, ]
# Train the Random Forest model
rf_model <- randomForest(winner ~ ., data = train_data)
# Predict the winner
predictions <- predict(rf_model, newdata = test_data)
# Evaluate the model
conf_matrix <- confusionMatrix(predictions, test_data$winner)
print("Confusion Matrix:")
## [1] "Confusion Matrix:"
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Chennai Super Kings Delhi Capitals
## Chennai Super Kings 4 0
## Delhi Capitals 1 0
## Delhi Daredevils 0 0
## Kings XI Punjab 0 0
## Kolkata Knight Riders 0 0
## Mumbai Indians 1 0
## Rajasthan Royals 0 0
## Royal Challengers Bangalore 1 0
## Sunrisers Hyderabad 1 1
## Reference
## Prediction Delhi Daredevils Kings XI Punjab
## Chennai Super Kings 0 1
## Delhi Capitals 0 0
## Delhi Daredevils 0 0
## Kings XI Punjab 0 2
## Kolkata Knight Riders 0 0
## Mumbai Indians 0 0
## Rajasthan Royals 0 1
## Royal Challengers Bangalore 0 0
## Sunrisers Hyderabad 0 1
## Reference
## Prediction Kolkata Knight Riders Mumbai Indians
## Chennai Super Kings 1 2
## Delhi Capitals 0 0
## Delhi Daredevils 0 0
## Kings XI Punjab 0 0
## Kolkata Knight Riders 3 0
## Mumbai Indians 0 2
## Rajasthan Royals 0 0
## Royal Challengers Bangalore 2 0
## Sunrisers Hyderabad 0 0
## Reference
## Prediction Rajasthan Royals Royal Challengers Bangalore
## Chennai Super Kings 0 0
## Delhi Capitals 0 0
## Delhi Daredevils 0 0
## Kings XI Punjab 0 1
## Kolkata Knight Riders 0 0
## Mumbai Indians 0 0
## Rajasthan Royals 3 0
## Royal Challengers Bangalore 0 2
## Sunrisers Hyderabad 1 1
## Reference
## Prediction Sunrisers Hyderabad
## Chennai Super Kings 0
## Delhi Capitals 0
## Delhi Daredevils 0
## Kings XI Punjab 0
## Kolkata Knight Riders 1
## Mumbai Indians 0
## Rajasthan Royals 1
## Royal Challengers Bangalore 0
## Sunrisers Hyderabad 2
##
## Overall Statistics
##
## Accuracy : 0.5
## 95% CI : (0.3292, 0.6708)
## No Information Rate : 0.2222
## P-Value [Acc > NIR] : 0.0002328
##
## Kappa : 0.4173
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Chennai Super Kings Class: Delhi Capitals
## Sensitivity 0.5000 0.00000
## Specificity 0.8571 0.97143
## Pos Pred Value 0.5000 0.00000
## Neg Pred Value 0.8571 0.97143
## Prevalence 0.2222 0.02778
## Detection Rate 0.1111 0.00000
## Detection Prevalence 0.2222 0.02778
## Balanced Accuracy 0.6786 0.48571
## Class: Delhi Daredevils Class: Kings XI Punjab
## Sensitivity NA 0.40000
## Specificity 1 0.96774
## Pos Pred Value NA 0.66667
## Neg Pred Value NA 0.90909
## Prevalence 0 0.13889
## Detection Rate 0 0.05556
## Detection Prevalence 0 0.08333
## Balanced Accuracy NA 0.68387
## Class: Kolkata Knight Riders Class: Mumbai Indians
## Sensitivity 0.50000 0.50000
## Specificity 0.96667 0.96875
## Pos Pred Value 0.75000 0.66667
## Neg Pred Value 0.90625 0.93939
## Prevalence 0.16667 0.11111
## Detection Rate 0.08333 0.05556
## Detection Prevalence 0.11111 0.08333
## Balanced Accuracy 0.73333 0.73438
## Class: Rajasthan Royals Class: Royal Challengers Bangalore
## Sensitivity 0.75000 0.50000
## Specificity 0.93750 0.90625
## Pos Pred Value 0.60000 0.40000
## Neg Pred Value 0.96774 0.93548
## Prevalence 0.11111 0.11111
## Detection Rate 0.08333 0.05556
## Detection Prevalence 0.13889 0.13889
## Balanced Accuracy 0.84375 0.70312
## Class: Sunrisers Hyderabad
## Sensitivity 0.50000
## Specificity 0.84375
## Pos Pred Value 0.28571
## Neg Pred Value 0.93103
## Prevalence 0.11111
## Detection Rate 0.05556
## Detection Prevalence 0.19444
## Balanced Accuracy 0.67188
# Plot the ROC curve
roc_curve <- roc(test_data$winner, as.numeric(predictions))
## Warning in roc.default(test_data$winner, as.numeric(predictions)): 'response'
## has more than two levels. Consider setting 'levels' explicitly or using
## 'multiclass.roc' instead
## Setting levels: control = Chennai Super Kings, case = Delhi Capitals
## Setting direction: controls < cases
plot(roc_curve, main = "ROC Curve for Random Forest Model")
This chart helps in evaluating the performance of the classification model by showing the counts of true positives, true negatives, false positives, and false negatives. This Roc plot illustrates the trade-off between sensitivity (true positive rate) and specificity (true negative rate) for different thresholds of the classification model.
This tells us how well the model is performing in terms of correctly predicting winners and losers of cricket matches. We can see where the model is making errors (false positives and false negatives). This ROC Curve gives us a visual representation of the model’s ability to distinguish between winners and losers. The closer the curve is to the top-left corner, the better the model’s performance.One difficulty might be handling imbalanced classes if one class (winner or loser) dominates the dataset.Additional approaches could involve feature engineering to create new variables that might better capture the dynamics of cricket matches, such as player statistics, team rankings,
In data visualization, clarity and interpretability are key. Ensure that your plots are easy to understand and effectively convey the model’s performance.Use appropriate labels, titles, and annotations to make your plots informative.Consider the audience and tailor the visualizations accordingly. For cricket enthusiasts, you might delve deeper into match-specific insights, while for a general audience, you might focus on overall model performance.